DOCOflBlZ SESnMl 



ID 137 365 



95 



TB 006 166 



lOTHOB 
TITLE 

IHSTITUTION 
SPONS AGEHCX 
PDB Dill 

NOTE 



EDSS PBICE 
DESCEiPTOBS 



XDBNIIPIIHS 



St. Pierre, Eobert G* j Laaner, Eosaaund 

Correcting Covariatas for Onreliaiilityi Does It lead 

to Differences in an Evaluator'a Conclusions? 

ibt lasQciatea, Inc* Caabriflgay Masa* 

Office of Education <DH1W) , Sashington, D*C. 

£lpr 77] 

300^75^0134 

27p*; Paper presented at the innual Meeting of the 
American Educational Eesearch Association (61st, Mew 
York, New York, April 4^8, 1977) 

ME^40.83 HC^$2,06 Plus Postage, 

Achievement Testsi ^Analysis of CoYariancei 

Compensatory Education Programs i Early Childhood 

Iducationi *Progra# Evaluation i *Test Eeliabilityj 

*True Scores 

Project Follow Thro ugh 



ABSIRACT 

One specific correction model suggested by Cohen and 
Cohen (1975) is applied to data collected in the evaluation of a 
large-scale quasi-experimental program (project Follow Through) , and 
the effects of different assumptions about test reliability on the 
analysis results and on the conclusions of the evaluators are 
examined. The study deter siiies whether the application of reliability 
or "true score" corrections alters the remults obtained via an 
analysis employing uncorrected covariates in such a fashion as to 
appreciably change the policy-oriented conclusions of an evaluator.. ^ 
The data on which this paper is based were collected for the 1976 
Follow Through evaluation and include measures on a total of over 
5,000 children who began the program in kindejrgarten |Fall 1971) and 
completed it in third-^grade (Spring 1975) « lesults indicate that 
application of true*score corrections using three separii.te . 
reliability estimates to covariates employed on analysis of 
covariance did not change the conclusions of the Follp,w Througk.^, _ 
evaluators* (EC) 



* Documents acquired by EEIC include many informal unpublished * 

* materials not available from other sources, EEIC makes every effort * 

* to obtain the best copy available. Nevertheless, items of marginal * 

* reproducibility are often encountered and this affects the quality * 

* of the mi^grofiche and hardcopy reproductions EEIC makes available * 

* via the iElC Document Eeproduction Service > (BDES) * EDES is not ^ * 
= * responsible for the quality of the original document, Eeproductions * 

* supplied by EDES are the best that can be made from the original, * 

3ii 3il « ^ Ifi « Iji Ijg ^ III jic Ifl 3^ ^ « 



erJc 



CORRECTING COVARIATES FOR UNRSLIABILITY i 
DOES MAD TO DIFPEMNCES IN m ^VALUATOR'S 
CONCLUSIONS? 



U.$ DUPfiSTMlNT OF HEALTH, 
EDUC&tlON i Wf LFARl 
NATIONAL tNSTITUTE OP 
EQUCATiON 

TMII DOCywENT HA^ BEEN RfePHO- 
PUCED EXACTLY AS BEeElVeO FROA?. 
THE PE RSON OS OB&AN»?AT)0H OBI&tN- 
ATINO IT POINTS VlfWQB OPlNlONI 
STATED DO NO? NECESSARILY REPRE- 
SENT OCPieiAL NATiONAL iNSTIrUTf OF 
iDUCATlQN PQS*TiON POt^Cy 



BY 

ROBERT ST.PIETOE 
AND 

ROSAMUND LMNER 

ASSOCIATES INC* 
55 Wheeler Street 
Cambridge, MA 02138 



AD 




Pres anted at the Annual Meetings of the toerican Educational Research 
Assoeiation, New York, April 4-8, 1977, 



This paper Lm based on research parformed by Abt Associatas Inc, 
undar CQntract No. 300-7S--0134 to the United States Office of Education, 
Naithar Abt Associates staff nor tha Of fice of . Education is responsible 
for any errors or omissions i the authors accept full responsibility for 
those. The statements in this paper do not necessarily represent any 
official position of the Office of Education , 



&rwong toe itiMy problems prevalent in the evaluation of educational 
programs are those concerned with the adjusteient of outcome scorae based on 
one or more oovariates. Typically^ avaluations of ttesa programs are 
implemented in a quasi-experimental fashion , and some version of the analysis 
of covarianea {MCOVA) is employed in an attempt to statistically equate 
treatment and comparison groups on one or more pretreatanent conditions * 
However, the application of ANCOVA to quasi^eKperimental data has been 
widely criticised because violation of the assun^tion that siibjecta should 
be randomly assigned to treatment And comparison groups leads to systematic 
bias {usually underadjustinent when the treatmant group is initially dis- 
advantaged with raspect to tiie control group) of outcome scores (Campbell 
and Boruch^ 1975). Achievement tests are commonly used as outcome ineasuras 
for educational progr^ns* Also^ they are often ^ployed as prCTeasures and 
serve as covariates in subsequent analyses. Sinca such tests are known 
to contain error, it has been argued that thay should be corrected for un*- 
reliability prior to entry into a covar-iance analysis (Lord^ 1960), 

The current debate about the merits of correction for unreliability 
has raised many methodological questions* For exOTtple* which of a wide 
variety of correction formulas should be used* and which of many available 
estimates of test reliability is appropriate? This paper doaa not add to 
or review the methodological literature * but instead applies one specific 
correction model suggested by Cohen and Cohen (1975) to data collected in the 
evaluation of a large-scale quasi-eKperimentaX program (Project Follow 
Through), and examines the effects of different ass^ptions about test 
reliability on the ani,lysis results and on the conclusions of tte evaluatorsV 
The pui^se of the study is to determine whether the application of 
reliability or "true-score'v corrections alters the results obtained via an 
analysis ^ploying uncorrected covariates in such a fashion as to appreciably 
change the policy-oriented conclusions of an evaluator. 

Background ^ 

The origins of Follow Through can be traced to an early evaluation 
of Project Head Start (Wolff and Stein/ 1966) which asserted that the 1965 
Head Start eKperiences had increased the Head Start children's school \ 



3 



2 



readiness* The fact that these presumed inGreaSes were not reflected in 
the achievement test performance Of the children at the end of their 
kinderg«ten e^erience in 1965 was attributed to the inappropriatenesa of 
traditional elementary eduoation. Although SOTie critics viewed this study 
as raising Questions about the value of Head Start * the Johnson a^dnistra- 
tion proposed a Follow Through program which would continue service to 
disadvantaged children through thiid grade , Funding problems forced a 
ohmge in the en^hasis of Follow Through from a full-scale service program 
to an experimental program in education in whieh educational specialists 
(sponsors) sponsored a variety of education^ models in groups of school 
districts (sites)* The eduoational strategies included i highly struo- 
turad projects emphasizing academic skills in taading and aritiuaeticr 
projects stressing cognitiva thijiking tteough asking and answering 
^estionS/ problem solving , and creative writingi projects emphasizing 
social-emotional development and encoiiraging es^loration and discovery in 
academic area^r and projects focusing on preparing parents to improve the 
education and development of their children (GAO^ 1975, pp. 3-^4)* 

In 196S the United States Office of Iducatien contracted with the 
Stanford Research institute to collect appropriate data as part of a 
national Follow Through evaluation* Since July 1972 ^ Abt Associates # Inc. 
has been analysing those data, and conanunicating the results in a series of 
reports. This paper is based upon work performed in the most recent of 
those reports (Stebfairis, St* Pierre, proper, to^rsioni. aad Cerva^ 1977) in 
which the priatary question addressed was whether the various educational 
strategies (operationalized through sponsors) being tested in Follow 
Through had differing impacts on thm academic and affective levels of the 
pupils they served,.* 



* The data and resul.ts reported in this paper are a subset of the 
data imd results included in the report by Stebbins^ St. Pierre, Proper, 
Anderson and Cerva (1977). The interpretations placed'on these data are 
intended to illustrate the way in which corrections for the unreliability 
of covariates change the conclusions of an evaluator, and are not meant to 
reflect the interpretations placed on the data by the Abt Associates evalua^ 
tion team, . . . . . 

■ - ■ 4 



3 



Method 

The data on which this paper is based include measures on Qver '^ 
5000 children who began their Follow Through e^^erience at entranca to 
kindergarten in the fall of 1971 and left Follow Through at exit from 
third grade in the spring of 1975 * Thes© pupils were distributed across 
nina sponsors ^ wh^e each sponsor impleiaented its educational progr^ in 
between five and seven school districts and where each school district 
contained a FoIIqw Through treatment group (FT) and a non-Follow Through 
comparison group (tIFT) . 

Sponsor effeGtiveness was judged in tems of both aoademic and 
affective outcomes and all children in the evaluation sa^le were 
administered the Metropolitan Achievement Testa (Elementary Level), thm 
Raven's Progressive Matrices (modified version), the Coopersmith Self* 
Esteem Inventory, and the Intellectual Achievemant Besponsibility Scale 
at the end of third grade. These four tests contain 11 outcome scores 
which ware grouped into three outcome domains as indicated in Figure 1* 



Figure 1 



DOMAINS OF -THIRD GRME TESTING IN FOLLOW THROUGH 



Outcome Domain 



Test 



BASIC 
SiaLLS 




> 



Metropolitan 

Achievement 

Tests 



COGNITIVE/ 
CONCTPTUAL 
SKILLS 




Reading 

Math Concepts 

Math Problm Solving 



Raven's Progressive Matrices 




Coopersmith Self-Esteem 



M^FECTIVE 
OOTCOmS 



Achievement Responsibility, Positive 
Achievement Responsibility, Negative 



Intellectual 
Achievement 
Responsibili 
Scale 



5 



4 



The Basia Skills are the simplest, objectives of traditional elemen- 
tary schooling i vocabulary, spelling the conventions of written lahguage, 
and aijf^le arithmetic oomputation, Cognitive/Conceptual Skills ~ CQmpre-- 
hens ion, reading # mathematical concepts / mathenatical problems, and 
abstraot problem- solving are also traditional aoademio goals, but are 
more compleK and tend to require Application. of some basic skills* 
Affective Outcomes are approximate measures of the children self "Concept 
and of their tendency to~ attribute success and failure to thOTSelves 
rather than to others • In addition , all pupils were atoLnistered a pre^ 
test, the Wide Rwge Achievement Test, upon entry to the program, and a 
set of standard student background measures were collac ted via parent 
interviews and school records # 

The primary technique for isolating and strengthening the signal 
of the Follow Through effect from the noise in the data was a s^tistical 
adjustoent of outCQme scores based on preeKis ting conditions* The set of 
covariates included the p^^etest, first language (English vs. non-English)^ 
family income, highest ouwapation in household, ethnic membership (two 
vectors, White vs* other. Black vs. other), sen, entry age^ and missing 
data codes (dummy variables coded 1 if missing and 0 if present) for 
income and occupation. In addition to these 10 variables, site specific 
(between site) covariates were coded for each sponsor to adjust for 
differences among sites. These between site covariates attempted to 
control for all nontreatoent differences among children related to 
differences in the sites where the Follow Through eKperiment was imple- 
mented, -to analysis of covariance was perforMd within each Follow Through 
sponsor for each of the 11 outcome measM^es, Differences among children 
related to the 10 covariates were adjusted out of each outcome measure 
with differences related to variations anong sites within a Particular 
sponsor being simultaneously controlled. The treatment condition was 
considered to be nested witJiin each site and w adjusted outcome differ^enc^ 
was estimated for eagh qu_tcome witoin each site . 

As stated earlier, the application of covariance techniques assumes 
that all covariatas ara perfectly reliable. However, such reliability 
cannot necessarily be assumed for each covariate in the present study. 



5 



Variables such as sex, ethnicity, income, occupation, education, language 
and age were all presmably measured wi-Ui mnimal error. The pretest 
posed the most serious problem. The reliability of the pretest was 
estimated on various Follow Through SMples by a measura of internal 
consistency (coafficient alpha) ajid was on the, order of ,90* 

Although there are several methods for dealing witii a singla 
faliible dovariate (Porter and Chibucos , 1974), tiie solution to tee problem 
in the multiple covariata case (even if only one of the covariates is 
imreliabla) is not clear* Cohen and Cohen {1975) offer a method that has 
not been mattematically proven and which "rests on no mora thjm the judgment 
of the present authors and some of our colleagues" (Cohen and Cohen, 1975, 
p, 373), Applying their method to the present case entailed correction 
only for the effects of unreliability in toe pretest. The procedure 
involved correcting the correlations of the unreliable covariate with each 
other aovariate and the outcoM for attenuation due to ^reliability by 
dividing each correlation by the square root of the estimated reliability 
of the covariate. In addition, the covariate standard deviation was 
corrected by multiplying the observed st^dard deviation by the sguare 
root of the covariata reliability. 

There is disagreement in^ the Uterature as to the most appropriate 
measura of reliability to employ in such correction methods, Alttiough 
the internal consistency (a statistic recommended as tee appropriate 
measure of reliability by some me thodo legists) of the pretest was high 
(,9), Campbell and Boruch (1975) suggest teat, as 1U:e time lapse be^een 
pretest and posttest increases, the oorrelation bRtween teem decreases, 
Consaguently , they recoinmend the pre-post correlation be used as the 
appropriate measure of reliability. 

Given this disagreement and the fact teat a direct measure of tee 
pre-post correlation for the pretest was not available, the Follow Through 
data were analysed using teree separate values spanning tee range of 
potential estimates. The reliability values selected were ,6?- ,8, and 
1,0, the latter value being the equivalent of not correcting for 
unreliability. 



with the child as the unit of analysis, the analysis estimated a 
sat of raw soore ^agression weights for each sponsor and outcoma using 
the model 

. Y ^ a + 2 b.x. 

^ i^l ^ ^ 



whara a is a oonstMt, s is the nun^r of sitas in a given sponsor, and 
o 

-'* ^2m+9 ^® regression weights for the predictor variables 
... ^2s+9 ^^^^^ ^® defined as follows i 



^10 



^ 10 oovariatas dafinad earlier 



X , K - a^l ba^aan site codes reflecting 

— ^--ihip in the sponsor's sites 



(see Cohen and Cohan, 197S, pp, 171-211, 
for dataila on "the coding of categorical 
variables with s distinct levels) 

X , . * . ^ ^ s traatmant within site codas 

s+10 2s+9 

With the regression coded in this fashion, the s regression weights 



b b ara intarpratable as adjusted estimates of toe FT/NPT 

s+10 2s+9 ^ - 

outcOTa diffarancas in the a sita^B, Thus, a total of 539 within-site 
estimatas of FT effectiveness were calculated — 11 astimates (one 
for each outcome) fot each of 49 sites (nagted within nine sponsors) 
in the analysis. 

Results 

Due to tha complexity of the evaluation Bxxd tea fact that a large 



number of adjusted outcoM diffarences were computed, a system was devised 
to htodle the interpretation of these results* Each was placed in one of 
three groups i 

# positive treatment effect — the Follow Through group in this 
site parforinsd batter than expected on this outcome given the 
performance of a similarly disadvantaged comparison group. 
An adjusted outcome difference was c&nsidarad to represent 
a positive treatment effect if it .favored FT, was statistically 
significant (p<*05), and greater in absolute magnitude than 
.25 standard deviation of the raw outcome maasure* 

■ 8 . . 



• null treatment effect — there was no difference between the 
performance of the Follow Through and comparison groups on 
this outcome in this site. An adjusted outcome difference 
was considered to represent a null treatment effect if it was 
not a positive or negative treatment effect* 

• negatiye traatment affect the Follow Through group in this 
. - .site performed less well than expected on this outcome given 

the performance of a similarly disadvantaged comparison group. 
An adjusted outcome difference was considered to represent a 
negative treatment effect if It favpred NFT* was statistically 
significant (p<*05)^ and greater in absolute magnitude than 
,25 standard deviation of the raw outcome measure. 

Smmnaries of the results of the three analyses categorized in the above 
fashion are presented at an aggregate level in Tables 1, 2^ and 3 and 
indicate that across all sites ^ sponsors, and outcomes ^ lower pretest 
reliability estimates lead to movement of treatment effects from the null 
category. Correction for unraliability in the pretest tends, in the aggre- 
gate, to make the treatment effects less favorable to Follow Through i with-- 
out correction, 463 (86 percent) of the effects are either positive or null; 
this number drops to 411 (76 percent) when corrected for a ,80 reliability 
estimate and to 388 (72 percent) when corrected for a *60 reliability estimate 

However, the point of the evaluation was td compare the effective- 
ness of sponsors, not to search for a Follow Through main effect. In order 
to facilitate sponsor comparisons the treatment effects (classified as posi- 
tive, null or negative) were aggregated by sponsor (nine sponsors) and 
outcome domain (mhomi earlier in Figure 1) * The nine sponsors were each 
placed in one of three broad groups according to their areas of primary 
interest (see Figt^e 2). Such a categorization is not intended to reflect 
all the complexities and nuances of each sponsor 's program* Readers 
interested in a description of each sponsor's program are referred to a 
report by Stebbins, Bo^ and Proper (1977) . 

Treatment effects were then aggregated by outcome domain within spon- 
sor, and average sponsor treatment effects were calculated by assigning values 
of "1" to a positive treatment effect, "0'* to a null treatmerit effect, and 
"-1*' to a negative treatment effect , Figures 3, 4 and 5 present sponsor 
average treatment effects in each of the three outcpme domains for the three 
different analyses while Table 4 presents the same data in tabular form. 

-s-- ■ - - ■■■■ ■ ■ 



Table 1 

SUMMARY OF. CHANGES IN^ TraATMEMT EFFECTS BETWEEN UNCORraCTlD 
AmOVk AND ANCOVA WHEN PRETEST IS 
COF:RECTBD USING A RELIABILITY ESTIMATE OF ,8 

Corrected ANCOVA 
(rel-.8) ■ 



Uneerracted 
ANCOVA 
(rel^l.O) 



positive 


null 


negative 




positive 


32 


0 


0 


32 


null 


19 


360 


52 


431 


nagatlve 


0 


0 


76 


76 




51 


360 


las 


539 



agrei^ent - 
correlation ^ -76 



Table 2 

SUMMARY OF CHANGES IN TMATMENT EFFECTS BETWEEN UNCO^^CTED 
ANCOm AND ANCOVA WHEN PRETEST IS 
COI^CTED USING A RELIABILITY ESTIMATE OF , 6 

Correoted ANCOVA 
(ral^,6) 







paeitive 


null 


nagativa 








positive 


29 


3 


0 


32 




Jncorractad 












pareant 


ANCOVA 


null 


27 


324 


80 


431 


agreement ^ 79 


(rel^l.O) 














negative 


0 


5 


71 


76 


correlation ^ -63 






56 


332 


151 


539 





Table 3 



SUMMARY OF CHANGES IN TREATfffiNT EFFECTS 
BETWEEN ANCOVA WHEN PRETEST IS COimECTED 

USING A RELIABILITY ESTIMATE OF ,8 
AND ^COVA WHEN PRETEST IS CORRECTED 

USING A RELIABILITY ESTIMATE OP .6 



CorreGtad ANCOVA 
(rel-.6) 



Corrected 
ANCOVA 
(re 1^.8) 






positive 


null 


negative 




positive 


44 


7 


0 


51 


null 


12 


314 


34 


360 


negative 


0 


11 


117 


128 




.56 


332 


151 

^ 10 - 


539 



percent 
agraement ^ 88 

correlation - .81 



Figure 2 

FOLLOW THROUGH MODELS PRIMARY EMPHASIS 



PRIMARY El^HASIS 


SPONSOR/MODEL N^E 


Basic Skills ^ These models focus 
first on the eleMntary skills of 
vocabulary, arithmetio oOTputation- 
spelling , and language • 


• University of Oregon ^ Direct 

Instruction Model 

# University of Kansas - Behavior ^ 

Analysis Approach 
® Southwest Educational Development 
Laboratory - Language Develops 
ment Education Approach 


Cognitive/C©nceptual ^ These models 
emphasise the mora complex "laarning-^ 
to^learn" problem solving skills* 


# University of Florida ^ 'Florida 

Parant Education Modal 
0 ^izona Center for Early Child- 
hood Education - Tucson Early 
Education Modal- 

• High/Scopa Educational Research 

Foundation ^ Cognitively 
Oriented Curriculum Model 


Affective/Cognitive ^ These models 
focus -primarily on self^ooncept and 
attitudes toward learning / and 
secondarily on "learning- to -learn" 
skills* 


m Far West Laboratory for Educa-- 

tional Research arid Development ^ 
Responsive Education Model 

• Bank Street College of Education ^ 

Bank Street College of Education 
Approach 

# Education Development Center - 

EDC Open Education Follow Through 
Program 



ERIC 



10 



11 



FIGURE 3: 

Sponsor Average Treatment Effects In Sasic Skins 



Basic Skills Model 

c = Cognitive/Conceptual 
Model 

A ^ Afficti ve/Cocnit1 ve 
Model 



Uncorrected 
Pretest 
(rel, - 1.0) 



> i r 



Oregon " 
SEDL ^ 
Kansas ^ 
EDC^ 

Far West Labs 
Florida- 
Arizona ^ 
High/Scope^ 
Bank Street ^ 



Corrected 
Pretest 
(rel, ^ ,8) 



Corrected 
Pretest 
(rel. ^ .6) 



Oregon 
SEDL ^ 
Kansas 



B 



EDC 



A 



Far West Labs 
Florida^ 
Arizona^ 
High/Scope 
Bank Street^ 



B 



-.7 ",6 ^.S -.4 ".3 -.2 -,r 



Oregon 
SEDL ^ 
Kansas 
EDC ^ 

Far West Labs 

Florida,^ 

Arizona 

c 

High/Scope 
Bank Strett ^• 



.7 



ERIC 



11 

12 



FIGURE 4: 

Sponsor Average Triatment Effects for Cognitive/Conceptual 
Skills 



s-BasIc Skills Model 

c ^ Cognitive/Conceptual 

4lodel 

. .. ■■■■ :.v--- 

4 ^Affective/Cognitive 
Model 



Uncorrected 
Pret^^t 
(reu - l.p) 



Far. West Labs 
SEDL^ 
Oregon 
EDC^ 
Florida ^ 
Kansas^ 
High/Seope ^ 
Arizona ^ 
lank Street 



Corrected 
Pretest 
(rel..^ .8) 



I i i 



Far West Labs 
SEDL^ 



Oregon 
IDC ^ 
Florida 
Kansas 



High/Scope 
Arizona ^ ; 
Bank Street 



.Corrected 
Pretest 
(rel, ^ .6) 



Far West Labs 
SEDL ^ 



B 



Oregon 
EDC ^ 
Florida 

Kansas 



High/Scope 
Arizona ^ 
Bank Street 



B 



,5 -A -a , -.2 -,1 



• 2 



.4 



• 5: .6 



.7 



13 




12 



FIGURE 5: . 

Sponsof Average Treitment Effects in Affictlve Outcomes 



Uncorrected 
Pretest 
(rel. - 1.0) 



s ^ Basic Sknis Model 

c = Cognitive/Conceptual 
Modi! 

A = Affect 1 ve/Cognl t i ve 
Model 



B 



Kansas 

c 

. Florida 
SEDL " 
EDC ^ 

c 

High/Scope 
Arizona 
Oregon ^ 
Bank Street 
Far West Labs ^ 



Corrected 
Pretest 
(rel.^ ,8) 



Corrected 
Pretest 

(rel.> .6) 



Kansas^ 
Florida ^ 
SEDL ^ 
EDC ^ 

High/Scopi^ 
Arizona^ 
Oregon ^ 
Bank Street ^ 
Far West Labs 



Kansas 
Florida ^ 
SEDL ^ : 
EDC-- . 
High/Scope^ 
Arizona ^ ' ~ 
Oregon^ 
Bank Street ^ 
Far West Labs 



-.7 -.6 -.S 



',4 -.3 



.2 -.1 



.1 



• 7 




14 

13 - - 



' Tabli 4 

SPONSOR AVERAGE TEATHENT EFFECTS IH BASIC SKILLS, . 
COGNIOT/CONCEPTUAL SKILLS AND 
APPECTIVE OUTCOME AREAS 





V. BASIC SKILLS . 


COGNITIVE/CONCiPTUAL SKILLS . 


APPECTIVE OUTCOMES ^ 


Sponsor 


uncorrected 


correeted 


eorricted 


uncarrictad 


corrictfd 


corredtid 


uncorrectad 


corrictid 


corrictid 




rel - LO 


rel - .8 


i"il s . 6 


ral = LO 


rtl ^ ,8 


rel - .6 


rel s 1.0 


rel V.8 


rtl = ,6 


Oregon^ 


30 

■ 


25 


in 


nn 




- in 


'•13 ■ 




-.jj 


Kansas 


,00 


.00 


-.0? 


■ -,14 


-.25 




'\''T10 


- 10 

t ± V 

- H 




B 

SEDL 


,00 


.00: . 


-,05 


.10 


.15 


,15 ;^ 


,'bo 


•13 


,13 : 


; V c 


".29 




".42 


-.25 


-.29 ' : 


-.21 


-.11. 


. -.11 


-.06 


C 

High/Scope 


-JO 


-.45 ■ 


-.40 




-.25 


-.30 


.00 


-.20 


-.20 


.Florida^ 


-.20 


-.15.. 


>.30 


-.05 


-.15. . 


-.15 


.. . ,07 


. .13 


,20 


A 

;Far West Labs 


■•13 


-.13 


>.25 




..08 


.08 


-.17 


-.17 


-.1? 


^ :^ 

■ Bank Straat , 


-.30. 


-.50,.- 


-.65 


■..-.30 


-.50 


-.50- 


-.13 


-.40 


-.47 




-.10 


-.IS 


-.15 


-.05 


-.15 


.00 


.00 


.07 


.07 - 



Basic Skills Model 
Cognitlvi/Concaptual Modal 
^ Affective/Cognitive Modal 



■ ERIC ; 



161 



-An examination of these data reveals some interesting overall 

First, correction for unreliability in the pretest appears :to _ ^ 
distort the rank order of the sponsors less with respect to BasiG Skills 
than with respect to Cognitive/Cdnceptual or Af feGtive outcomes. Seoond, 
Buoh corrections tend to produce lower estimtes of the absolute level of 
sponsor effectiveness in all three outcome areas, ^his is most pronotmced 
in Basic Skills, At a less global level it can be seen that according to 
the analysis using an uncorrected pretest (rel - 1,0} ^ the moNdels which 
emphasize Basic S3d.lls do better on testa of these skills than mcdels 
which emphasiM the Cognitive/CQnceptual or Affective areas. In particular^ 
the University of Oregon's Direct Instruction Model is clearly more 
effective in Basic Skills ttan the rest. Correotion of tte pretest does 
little to alter this interpretation when a reliability cQefficient of ,8 
is assumed? Oregon still appears to perform best and the Basic Skills 
models have higher average treatment effeots than other 'model types, 
though, that sponsors in general have lower estimated levels of effective- 
ness in Basic Skills when the pretest is corrected {re.l = ,8) , Changing 
to a reliability estimate of .6 further depresses overall averages^ but does 
little to alter the relative standing of sponsors* 

An examination of average sponsor treatment effects in Cognitive/ 
Conceptual Skills ^^eveals a somewhat different pattern. When the pretest 
is not corrected Far West Labs is the best performer and no single model type 
appears most effective. Correction of the pretest for unreliability 
(rel. =.8) changes this interpretation slightly as ^the estimate of SEDL^s 
effectiveness ; is raised while Far West Labs becomes less effective. Use 
of a reliability estimate of .6 father alters the. relative standing of 
some sponsors , although none change more than one or two rank positions . As 
is the case with Basic Skills outcOTes, most sponsors appear less 
effective in terms of Co^itive/Conceptual Skills when toe pretest is 
corrected, 

- With respect to the Affective area/ tonsas and Florida are the 

most effective sponsors in the uncorrected analysis Crel ^ 1,0). Cor- j: 
rection of the pretest Crel - ,8) dramatically lo estimate of ' / 

efxeGtiveness for Kansas while raising it for Florida ^ SEDL and EDC,^ . ; 



17 



^anginf to astimate of *S fxirther separates ^e sponsora* Again, : 
the ovaralL, affect of correcting the pretest for assumed unreliability 
is to lower our estimate of effectivanass fof most sponsors* 

The applieation af pretest oorreotions^ therefore ^ chMges the 
estimates of both the relative standing of sponsors iind tte absolute 
level of sponsor effectiveness differentially by outcome area. The 
changes in interpretation are clearest in Basic Skills ^ where the rankiag 
of sponsors is essentially preserved^ and the overall level of effective^ 
ness is lowered fairly consistently across sponsors.' The s^e. pattern is 
Irss evident but still present in the Cognitive/Conceptual and Affective 
areas where ch^ges in the rmk order of sponsors occur more often. 

Discussion -y'^- --.-=-.-... ............ ^ . 

As not^d in the results section, the primary effect of correcting 
the pretest for assumed unreliability is to deflate the imcorrected 
estimates of Follbw Through eSfectiveness while essentially preserving 
the raaik order of sponsors. This pattern is more clearly seen witii 
Kiespect to Basic Skills thm. other outcomes* The question which now 
arises isi P^y did this happen? Let us Jfirst consider what might occur 
when a covariate is corrected for \mreliability in the evaluation of a 
typical conpensatory education program. In such a program we expeet the 
treatment group to have a lower pretest meM ttian the comparison group , mnd 
therefore, correction for ^unreliability in tiie pretest should act to make 
adjusted postta at differences more favorable to the treatment group- 
Fig^e 6 shows an example where the treatment group me^ score is below 
that of the^ comparison group on boUi the pretest and posttest. The / 
paraXlel solid lines represent the regression lines for the treatoent and 
cott^arison groups in an uncorrected analysis while tiie dashed lines 
represent, the regression lines . for ^^^'t^ groups when the pretest has 

been corrected for rnireU-ability, Since the regression lines must pass 
through the means .of tteir samples , and the..5lope. of ..the regression . _ ... 
lines in the corrected analysis is, by definition^ steeper ttan that in 
the uncorrected analysis, the separation of the regression lines auid 
hence the adjusted mean difference between the treateient and comparison 

groups is smaller for the analysis using the corrected pretest (D < D ) — 

c u 

correction for unreliability has improved the standing of tiie treatoent : 




pretest 



ErIc 



group. Mow, the effeat of this correction depends on the location of the 
treatment and compMison group meMs and on the prm'-^post corralation , - and . - 

is only- meant to reprasant a single situation — ^ one which is 
lijcely to occur in the evaluation of compensatory education programs 
l^t us see if mi examination of pretest Mans for the Follow 
Through sponsors allows us to apply the above logic to Follow Through, 
T^le 5 prasents descriptive statistics by treatment group within sponsor 
for the protest and toe four Basic Skills posttests* It can be saen that 
the treatment group (FT) scores subst^tially lower toan the comparison 
group (OTT) in only two sponsors^ Arizona and Par Wast l^s* For all 
. other spTOsors the FT group scores above or about the same as the OTT 
group* This suggests that Mi^ona and Far Wast Labs might appear more 
effective when the analysis is corrected for unreliabili^ in the pretest 
while other spans or s would appear less effective or show no change. 
However, this is not the case* A reexamiiiatim of Figures 3, 4 Md 5 shows 
that these /b^r© sponsors do not gain in effactiveness in the corrected 
analyses. In fact^ the overall pattern of sponsors appearing lass ef fee-* 
tive in the corrected analyses is vary strong ^ a finding which is not 
intuitively appealing* It would seam teat sponsors with treatarent groups 
that score lower thM comparison groups on tte pretes t she old be halped by 
correction for imreliability* Perhaps tiiara is some other factor 
operating which is causing the general drop in program effectiveness, 

Ti^le 6 presants ad justed outcome differences {regression weights 
for the FTVNFT within-sita contrast — corresponding to variables x \i „ 
, ^2s+9 ^ analytic m^al presented earlier) , associated standard 
errors, and t^ratios by outcome for Uie imcorrectad and the two corrected 
analyses. The data in this t^le are averages of statistics calculated 
for each site within each sponsor. There are 49 sites in the nine 
sponsors , and -^aref ore 'each nianber presented in T^le ,6 is based on 49 . 
site level, pieces of tota. It c^ be seen that across -malysas there is 
ve^' little change in the average adjustad butcoma" differences • On tiia" 
othiir hmd , there is a pronouncad reduction in the mtzm - of ^the standard - - 
errers of those adjusted differences (on the order of a 30 percent 
da ere ss^ be twean the s tmdard error o f the uncorraGtad and cor re cte d 
for rel — .8 analysas) . These to/o conditions lead to an increase in iie , 



ERIC 



Tabla 5 



DESCRIPTIVE STATISTICS FOR THE PRETEST AND BASIC SKILLS 
, OUTCOMES BY SPONSOR AND TREAT^ffiNT GROUP 



Basic Skills Outcomes 

























^ Math 














Word 










■ Compu 










Pretest 


l^owledge 


Spelling; 


Language 


tations 




Traat-^ 
























Sponsar 


inant 


N 






V 
A 


O LJ 




O LJ 










Oregon 


FT 


316 


29.6 


10.3 


24*6 


9.2 


20*2 


11.6 




10.0 


22.5 


8*7 




NFT 


317 


30.9 


12,9 


27.8 


11* 2 


23*0 


12^5 


19*2 


9,6 


19.9 


7.8 


Kansas 


FT 


585 


28*2 


11 * 6 


24.0 


10*5 


19*9 


12*6 


16*9 


8,5 


19*2 


7*6 




NFT 


762 


26*2 


12*1 


22*9 


10.4 


19*4 


12*9 


15.8 


7.5 


16*2 


6*8 


SEDL . 


FT 


492 


26*1 


12.8 


19.9 


9,7 


14.0 


12.1 


15.6 


7,2 


17*9 


7.3 




NFT 


563 


26*6 


11*1 


21.4 


9*5 


17,9 


12*6 


15.1 


6,9 


-15*6 


6.5 


toizona 


FT 


329 


30.6 


13.1 


24.1 


10.8 


17.4 


11-3 


17*6 


8*5 


18.6 


7.8 




NFT 


292 


35*2 


13*4 


33.8 


11*1 


25.4 


11.2 


22,7 


10.0 


22.1 


8*3 


High/Seopa 


FT 


177 


28*1 


12*0 


19.9 


10,9 


14,8 


19*8- 


1-3.2- 


6*2 


15.-3 


6.7 




NFT 


337 


29*4 


12*3 


24.9 


10.8 


12*8 


13,0 


16*9 


8*0 


17*6 


7,4 


Florida. 


FT 


254 


27.8 


11.7 


23*4 


10.9 


17*1 


12*2 


15*8 


7*1 


15.9 15.7 




NFT 


481 


27,2 


12*1 


22*2 


10,9 


18.4 


12.8 


16,0 


8*3 


6*0 


6*8 


Far West Labs 


FT 


241 


28.9 


12,4 


22.7 


11.5 


15*1 


12*5 


15.7 


7*6 


17.3 


7.1 




-NFT 


277 


32*2 


12*2 


27.0 


11.3 


20.3 


12.3 


18.7 


8.7 


18.2 


7,1 


laiik Street 


FT 


264 


31*3 


12*8 


23*5 


10*9 


17.5 


12.5 


16*9 


8*5 


16.4 


6.8 




-NFT 


587 


28.3 


12.0 


24*0 


10.7 


20*8 


12*7 


16.2: 


8.3 


16.9 


7.2 


EDC ' 


FT 


248 


28*7 


12*0 


21.8 


11*4 


15.6 


12,4 


16*4\ 


^ 8*6 


17 * 1 


6*4 




NFT 


487 


29,1 


12.6 


23.5 


11.0 


19.6 


12,8 


16*7 


9,0 


16.8 


7.5 



21 



. : Table 6 , 

AVERAGE ADJUSTED OUTCOffi DIFPERlNCE, 
STANDARD ERROR AND T-RATIO BY OUTCOME 
FOR UNCOHMCTED AND CORMCTID ANMiVSES 



■ ■ 






ANALYSIS 








Uncorticted (ril> 1.0 


Correct 


ed (rel 


B .8) 


Correetad (reL- .6) 








Adj. 


Std. 




Adj. 


'stir 




Adj,'' 


; Std.;; 




uoniain 


Outcoine 


N 


Diff. 


Error 


Ratio 


Diff,' 


Error 1 


tetio 


Diff. 


Error 


Ratio 


Basic 


Word knowledge 


49 


■2,03.: 


2,50 : 


. ■ - 
93 


/•2.06 


1.74 ■ 


4.34\ 


-2.07 


1.50 


-1.62 


Sxills 


























Spelling _ 


.49.: 


"2.56. 


. 3.08 ^ 


- .98 ■ 


,."2.61 


2.18 ■ 


■1.41 


;-2.91 


;i'96. 


■1.75 




Language 


49 


-.69 


1.99 


-,39 


-.73 


1.41 . 


-,54 


' f.n 


1,26 


^ -,69 




Math Computations 


49 


.38 


1.80;^ 


.28 


.36 


1.29 


.37 


,32 


1.19 


.35 


Cognitive/ 


Ravens - 


49 


-.47 


1.17 


-.46 


-.48 


.83 


-.73 


-,51 


.77 


-.86 


L0nC8pLUai 
























Skilli 


Reading . 


49 


-.89 


1.80 


-.55 




1.25 


-.82 - 


>.I3 


1.13 • 


-,96 




Math Concipts 


49 


-.71 


1,78 ■ 


-.42 


-.73 


1.24 


-.63 


-.75 


1.07 


-.80 




Math Probl, Solv, 


49 


-.19 


1.59 


-.18 


-.20 


1.13 


-.30' 


-.23 


1.02 


-.38 


'""Affective 


Cdopersmith 




""-.92 


2.08 


-,51 


-.93 


L53~" 


-.72 


"-.95 


r.5o, 


-.76 


Outcoraes 






















1 




lAFS ("I 


49 


-.06 


.77 


.01 


-.07 


.57 


.02 


-.07 


.56 


-.00 




lARS (+) 


.49 


-.09 


.68 


-.17 


-.09 


.49 - 


-.25 


-.10 


.48 


-.30 



23 



size of t-ratios r which are derived by dividing the adjusted outcame 
difference by its associated sttodMd error,* md a corresponding 
increase in the nmber of significant effects , Since Table 6 shows that 
the distribution of adjusted outcoma differences has a maw leSE than zero 
for all outcomes axeept math computations , and since these distributions 
tend to ba positively skewed, the effect of reducing tha standard error 
is to increase the nuBaber of negative ef facts at a fas tar rata th^ the 
nimber of positiva effects v Pig^e 7 gives a raprasentat^^ 
happens - 

We have seen that correction of the pretast for assumed unraliability 
can lead to changes in the conclusions that an a valuator reaches in terms 
of ^tha^rank^order..of _„ sponsors- as wall- as the^ overall ^level -of -program„^~— ^- — - 
effactiveness (across sponsors)* Such changes were showi to be depandent 
on a variety of factors. Firsts Basic SkTHl outcomeiT^ which 
easiest to measure and hance the most raliablef show the fewest ctengas 
in rank order among sponsors / while Aff active outcomes, surely the most = 
difficult to measure and hence the least reliable^ show the most changes in 
rank order ^ among sponsors* Second, changes in conolusions do not depend 
directly on treatment/comparison group pratest differences. The two sponsors 
with treatment groups that scored lower than their comparison groups on 
the pretest did not particularly benefit from the application of ^corre 
for imraliability^ Third, changes in conclusions depend on tha initial 
level of program success. To the extent that standard errors are lessened 
it becomes easier to find statistically significant dif fer^ces between 
groups* Fourth, although not investigatadin^tKis paper, the eKistence 
of covariates othar than the pratest ean have an important effect on the 
results sinca the correlation of each other covariata with tha pretest 
as well as the pratest/outcome correlation is corrected* Finally, the 
appropriatanass of - the pratast reliability coefficient must be considered* 
If the appropriate reliability is on the order of * 90 as a coefficient of 



* Nota that the average t-ratio does not . necessarily equal the 
average adjusted outcome differenca divided by tha average standard error* 



21 



Figure 7 

REPraSBNTMIOM OF THE DISTMBUTION OF ADJUSTED 
OUTCOJffi DIFFERENCES FOR A GI^TON 
OUTCOfffi J^ASUM 




additional signifiaant | 
nagativa differences > 
when pretest is 
correatad 



signf icant negative 
.differences for 
uncorrected analysis 
(rel > 1.0) 



additional significant 
positive differences 
whan pretest is corrected 



significant positive 
differences for uncorrected 
analysis (rel - 1 , 0) ' 



SE ^ two standard errors according to corrected analysis 
SE^^ two standard errors according to uncorreiCted analysis 



25 



ERicr 



22. 



intarnal consistency shows, corraction for mreli^ility will make very 
little difference* On the other hMd, the lower estimates of pretest 
reli^ility used in this study lead to increasingly important chmges 
in conclusions* ' i " 



26 



BIBLlOGRAfHY 



Campbell, T* and loruch, F. Making the sase for randomizad 
assigrmient to treatments by considering the alternative.^* 
In C. A* Bennett and A* A, LuMdaine (Eds,), SvaluatiQn and 
E ypariiriant , New Yorki AcademiG Press, 1975* 

Cohan, J, and Cohan, P* Applied multipla reggession/correlatlon 
analysis for the behavioral eoiancas * Hillsdale, N,j,i 
Lawranca Irlbaim Associatas , 1975* 

GAO. F_el_low ThrQughi Lmsmonm learned frm its ayaluation and the 
need to Ijiprove its adudnistration C^ieport to Congress, 
MigD-75^4) . Washington, D,C\i GAO, 1975 * 

Lord, F. M* Larger-scale oavariMca analyrjis whan the control variable 
is fallible. Journal of the Amerioan Statistical Assooiation , 
1960, 55, 307-321, 

Porter, A* C*, and Chibudos, T* R* Seleoting analysis strategies. 
In G* D. Borich (Ed, ) , Evaluating educational proqr^is Md 
produots > Englewood Cliff s ,~nVJ, i Eduoational Technplogy 
Publioations , 1974* 

Stabbins, 1,, look , G and Proper , E * C* Education as experiment 
tationi A planned variation model. Volume XV-S . Cambridga , 
Massachusetta^i Abt Associates , Inc , , 1977 * 



Stabbins, L, B* , St *Piarre, G* , Proper, E* C*# todarson, R, B,, and 
Cerva, T, R, Education as expegimantationi A planned variation 
modal, Volme IV-A ^ Cambridge , Massachusatts i Abt Associates , 
inc, , 1977. 

Wolff, and Stein, A, Six months later. Head Start evaluation 
project * Naw Yorki Yeshiva University, Ferkauf Graduate 
School of Education, 1966* 



27 



24 



